Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance
نویسندگان
چکیده
The increased relative cost of accessing memory is encouraging processor designers to explore deeper uniprocessor speculation (e.g., with branch and value prediction) and consider multiprocessor speculation (e.g., on coherence message types and values). While some mechanisms have been proposed to support deep speculation using speculative multithreading, current mechanisms for conventional processors are not as good. To support kilo-instruction speculation with conventional processors, this paper proposes Multiversion Memory (MVM), a processor/memory interface that allows processors to create multiple versions of memory and recover to previous versions when necessary. In this paper, we develop an efficient implementation of MVM that uses a level one cache to keep recent speculative blocks (like a future file for memory), uses version buffers to keep old versions of blocks for which speculation is pending (like a memory history buffer), and leaves the level two cache (and beyond) unchanged (like a memory architectural file). Concurrently, requirements for highly-available computers and manufacturing trends to deepsub-micron design encourage techniques to mask transient faults (e.g., with error correcting codes and execution retry). Most current designs consider speculation and fault tolerance independently. Nevertheless, a second result of this paper is that MVM can provide support for both needs, perhaps making the use of hardware fault-tolerance more widespread. Simple cost models with parameters from commercial workloads show that our implementation of MVM allows kilo-instruction speculation and fault tolerance that can recover faster (e.g., less than 273 vs. 362 cycles), uses recovery storage that is smaller (e.g., 5,356 bytes vs. 10,000 bytes), and has lower common-case overhead than other recently proposed schemes.
منابع مشابه
Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a softwa...
متن کاملExtending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI
Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major...
متن کاملA Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major...
متن کاملExploiting Value Prediction for Fault Tolerance
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance techniques rely on explicit redundant execution for fault detection or recovery which incurs significant performance, power, or hardware overhead. This paper makes the observation that value predictability is a low-cost (albeit imperfect) form of program redundancy that can be exploit...
متن کاملFaulTM-multi: Fault Tolerance for Multithreaded Applications Running on Transactional Memory Hardware
Fault-tolerance has become an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults and silent data corruptions. However, comparing the results of the instruction streams, checkpointing the entire sy...
متن کامل